Required Packages

library(tidyverse)
library(lubridate)
library(ggridges)
library(ggmosaic)
library(XML)
library(rlang)

Intro

A few years ago someone pointed out to me that iPhones record step and distance data automatically and you can check the daily results via the Health app. It’s also no secret that the last few years has featured an increased focus on how leading a sedentary life working at a desk can be bad for your health and people should strive for 10,000 steps per day. If you bother to read the article, it is clear 10,000 steps isn’t a magic or even necessary number but it’s also clear that a lot of people don’t reach the daily step totals they should be. I have put in a big effort to get my steps up and was curious to see what the data had to say about my improvement. This piece of work involves some brief analysis and visualisation of my step data from mid-2015 (21/08/2015 to be precise) to now (last day of 2018).

It’s also worth noting before starting that the step data recorded by the phone couldn’t be entirely accurate as this is obviously not the primary function of the device. Additionally, it would seem reasonable to think the phone would record consistently lower step totals than are actually being taken every day, because I don’t have my phone on me at all times. For this reason it seems wiser to focus on the trends in the data, rather than on the absolute figures.

Data

# data import
health_xml <- xmlParse("export.xml")

# convert the XML to a list
health <- health_xml %>% xmlToList()

# list structure
health %>% str(list.len = 10)
## List of 95133
##  $ ExportDate: Named chr "2019-01-05 10:50:13 +1100"
##   ..- attr(*, "names")= chr "value"
##  $ Me        : Named chr [1:4] "1991-05-30" "HKBiologicalSexMale" "HKBloodTypeNotSet" "HKFitzpatrickSkinTypeNotSet"
##   ..- attr(*, "names")= chr [1:4] "HKCharacteristicTypeIdentifierDateOfBirth" "HKCharacteristicTypeIdentifierBiologicalSex" "HKCharacteristicTypeIdentifierBloodType" "HKCharacteristicTypeIdentifierFitzpatrickSkinType"
##  $ Record    : Named chr [1:8] "HKQuantityTypeIdentifierHeight" "Health" "10.2" "cm" ...
##   ..- attr(*, "names")= chr [1:8] "type" "sourceName" "sourceVersion" "unit" ...
##  $ Record    : Named chr [1:8] "HKQuantityTypeIdentifierBodyMass" "Health" "10.2" "kg" ...
##   ..- attr(*, "names")= chr [1:8] "type" "sourceName" "sourceVersion" "unit" ...
##  $ Record    : Named chr [1:7] "HKQuantityTypeIdentifierStepCount" "iPhone" "count" "2015-08-21 19:43:46 +1100" ...
##   ..- attr(*, "names")= chr [1:7] "type" "sourceName" "unit" "creationDate" ...
##  $ Record    : Named chr [1:7] "HKQuantityTypeIdentifierStepCount" "iPhone" "count" "2015-08-21 19:43:46 +1100" ...
##   ..- attr(*, "names")= chr [1:7] "type" "sourceName" "unit" "creationDate" ...
##  $ Record    : Named chr [1:7] "HKQuantityTypeIdentifierStepCount" "iPhone" "count" "2015-08-21 19:43:46 +1100" ...
##   ..- attr(*, "names")= chr [1:7] "type" "sourceName" "unit" "creationDate" ...
##  $ Record    : Named chr [1:7] "HKQuantityTypeIdentifierStepCount" "iPhone" "count" "2015-08-21 19:43:46 +1100" ...
##   ..- attr(*, "names")= chr [1:7] "type" "sourceName" "unit" "creationDate" ...
##  $ Record    : Named chr [1:7] "HKQuantityTypeIdentifierStepCount" "iPhone" "count" "2015-08-21 19:43:46 +1100" ...
##   ..- attr(*, "names")= chr [1:7] "type" "sourceName" "unit" "creationDate" ...
##  $ Record    : Named chr [1:7] "HKQuantityTypeIdentifierStepCount" "iPhone" "count" "2015-08-21 22:40:41 +1100" ...
##   ..- attr(*, "names")= chr [1:7] "type" "sourceName" "unit" "creationDate" ...
##   [list output truncated]

Digging into the List

At a glance it seems like the list contains the export date and personal information in the first 4 elements then a record of step counts etc in the following elements. It seems a bit odd there are some elements named “Record” which contain information relating to height etc when this is seemingly also the name for elements containing the data related to steps etc recorded by the device. It does seem to be a reasonably neat list though, so extracting the information needed shouldn’t be too tricky.

# confirm the element names 
health %>% 
  names() %>% 
  unique()
## [1] "ExportDate" "Me"         "Record"     ".attrs"
# confirm all elements are character vectors
health %>% 
  map_lgl(is.character) %>% 
  all()
## [1] TRUE

Checking the names confirms the names shown above are the only names for the list elements (which are all character vectors). Now to dig one layer deeper and check the names of each list element.

# check names of the character vector of each element
health %>%
  magrittr::extract(names(health) == "Record") %>% 
  map(names) %>%
  unlist() %>%
  unique()
## [1] "type"          "sourceName"    "sourceVersion" "unit"         
## [5] "creationDate"  "startDate"     "endDate"       "value"        
## [9] "device"

Value, type unit seem to be of interest here - let’s have a look.

# check all types
health %>%
  map("type") %>% 
  unlist() %>% 
  unique() 
## [1] "HKQuantityTypeIdentifierHeight"                
## [2] "HKQuantityTypeIdentifierBodyMass"              
## [3] "HKQuantityTypeIdentifierStepCount"             
## [4] "HKQuantityTypeIdentifierDistanceWalkingRunning"
## [5] "HKQuantityTypeIdentifierFlightsClimbed"
# check all units
health %>%
  map("unit") %>% 
  unlist() %>% 
  unique() 
## [1] "cm"    "kg"    "count" "km"
# check first 10 "values"
health %>%
  map("value") %>% 
  unlist() %>% 
  magrittr::extract(1:10)
##                  ExportDate                      Record 
## "2019-01-05 10:50:13 +1100"                       "180" 
##                      Record                      Record 
##                        "72"                       "414" 
##                      Record                      Record 
##                       "241"                       "187" 
##                      Record                      Record 
##                       "140"                        "74" 
##                      Record                      Record 
##                        "19"                        "94"

It appears the list elements contain, along with a bunch of other information, a type that identifies what the record’s data is related to, a unit measure for the data recorded and, of course, a value. Before proceeding, I’m just going to double-check everything is at it seems by inspecting 2 elements of interest.

# steps check
health %>% magrittr::extract(map_lgl(., ~ "HKQuantityTypeIdentifierStepCount" %in% .)) %>% 
  magrittr::extract(1)
## $Record
##                                type                          sourceName 
## "HKQuantityTypeIdentifierStepCount"                            "iPhone" 
##                                unit                        creationDate 
##                             "count"         "2015-08-21 19:43:46 +1100" 
##                           startDate                             endDate 
##         "2015-08-21 18:29:10 +1100"         "2015-08-21 18:34:15 +1100" 
##                               value 
##                               "414"
# distance check
health %>% magrittr::extract(map_lgl(., ~ "HKQuantityTypeIdentifierDistanceWalkingRunning" %in% .)) %>% 
  magrittr::extract(1)
## $Record
##                                             type 
## "HKQuantityTypeIdentifierDistanceWalkingRunning" 
##                                       sourceName 
##                                         "iPhone" 
##                                             unit 
##                                             "km" 
##                                     creationDate 
##                      "2015-08-21 19:43:46 +1100" 
##                                        startDate 
##                      "2015-08-21 18:29:10 +1100" 
##                                          endDate 
##                      "2015-08-21 18:34:15 +1100" 
##                                            value 
##                                        "0.31032"

Seems all good.

Inspecting the list elements also gives a bit more insight into the date fields. It appears “startDate” refers to when the activity begun, “endDate” when it finished and “creationDate” perhaps when it is recorded by the device. It doesn’t really matter anyway, as I am going to use “endDate” as the timestamp for my data as it seems to make the most sense - it indicates the conclusion of the activity being recorded.

Data Preprocessing

Now the list is better understood, the data can be put into tidy data frames for analysis and plots. To begin I’m going to extract the step and distance into separate “raw” format data frames where every row is for an individual timestamp produced by the “endDate” field. It might make more sense to put all of the step and distance data into the one data frame but for some unknown reason the two metrics don’t have matching timestamps, which makes things tricky.

# create vector for "types"
types <- health %>% 
  map("type") %>% 
  unlist() %>% 
  unique()

# create index for steps
steps_index <- health %>% 
  map_lgl(~ types[3] %in% .)

# extract all list elements for step data
steps <- health[steps_index]

# extract all step count data
step_counts <- steps %>% 
  map("value") %>% 
  as.integer() %>% 
  unname()

# extract all timestamps and convert to melb datetime
steps_time_stamps <- steps %>%  
  map_chr("endDate") %>% 
  unname() %>% 
  ymd_hms(tz = "Australia/Melbourne")

# create data frame of steps data
steps_df <- tibble(
  time_stamp = steps_time_stamps,
  date = time_stamp %>% as_date(),
  step_counts = step_counts
)

steps_df
## # A tibble: 42,564 x 3
##    time_stamp          date       step_counts
##    <dttm>              <date>           <int>
##  1 2015-08-21 17:34:15 2015-08-21         414
##  2 2015-08-21 17:40:02 2015-08-21         241
##  3 2015-08-21 17:45:12 2015-08-21         187
##  4 2015-08-21 17:49:11 2015-08-21         140
##  5 2015-08-21 17:58:41 2015-08-21          74
##  6 2015-08-21 20:38:03 2015-08-21          19
##  7 2015-08-21 21:02:34 2015-08-21          94
##  8 2015-08-21 21:07:44 2015-08-21          42
##  9 2015-08-21 21:07:54 2015-08-21           4
## 10 2015-08-21 21:33:11 2015-08-21          17
## # … with 42,554 more rows
# create index for distances
distances_index <- health %>% 
  map_lgl(~ types[4] %in% .)

# extract all list elements for distance data
distances <- health[distances_index]

# extract all distance covered data
distance_covered <- distances %>% 
  map("value") %>%
  as.numeric() %>% 
  unname()

# extract all timestamps and convert to melb datetime
distances_time_stamps <- distances %>% 
  map_chr("endDate") %>% 
  unname() %>% 
  ymd_hms(tz = "Australia/Melbourne")

# create data frame of distances data
distances_df <- tibble(
  time_stamp = distances_time_stamps,
  date = time_stamp %>% as_date(),
  distance_covered = distance_covered
)

distances_df
## # A tibble: 45,956 x 3
##    time_stamp          date       distance_covered
##    <dttm>              <date>                <dbl>
##  1 2015-08-21 17:34:15 2015-08-21          0.310  
##  2 2015-08-21 17:40:02 2015-08-21          0.183  
##  3 2015-08-21 17:45:12 2015-08-21          0.152  
##  4 2015-08-21 17:49:11 2015-08-21          0.101  
##  5 2015-08-21 17:57:41 2015-08-21          0.002  
##  6 2015-08-21 17:58:41 2015-08-21          0.0399 
##  7 2015-08-21 20:38:03 2015-08-21          0.00969
##  8 2015-08-21 20:58:22 2015-08-21          0.00969
##  9 2015-08-21 21:02:34 2015-08-21          0.0604 
## 10 2015-08-21 21:07:44 2015-08-21          0.0504 
## # … with 45,946 more rows

Now to collapse these data frames so each row is a single day and add some additional variables based on the timestamps.

# create summarised steps data frame
steps_summary <- steps_df %>% 
  group_by(date) %>% 
  summarise(total_steps = step_counts %>% sum()) %>% 
  mutate(year = date %>% year() %>% factor(),
         month = date %>% month() %>% factor(),
         day_of_week = date %>% wday(label = T, week_start = 1) %>% factor(ordered = F), 
         weekday_weekend = if_else(day_of_week %in% c("Sat", "Sun"), "Weekend", "Week Day") %>% factor())

steps_summary
## # A tibble: 1,223 x 6
##    date       total_steps year  month day_of_week weekday_weekend
##    <date>           <int> <fct> <fct> <fct>       <fct>          
##  1 2015-08-21        1242 2015  8     Fri         Week Day       
##  2 2015-08-22        7836 2015  8     Sat         Weekend        
##  3 2015-08-23        4416 2015  8     Sun         Weekend        
##  4 2015-08-24        6700 2015  8     Mon         Week Day       
##  5 2015-08-25        5862 2015  8     Tue         Week Day       
##  6 2015-08-26        5432 2015  8     Wed         Week Day       
##  7 2015-08-27        4368 2015  8     Thu         Week Day       
##  8 2015-08-28       10167 2015  8     Fri         Week Day       
##  9 2015-08-29        5388 2015  8     Sat         Weekend        
## 10 2015-08-30        5127 2015  8     Sun         Weekend        
## # … with 1,213 more rows
# create summarised distances data frame
distances_summary <- distances_df %>% 
  group_by(date) %>% 
  summarise(total_distance = distance_covered %>% sum()) %>% 
  mutate(year = date %>% year() %>% factor(),
         month = date %>% month() %>% factor(),
         day_of_week = date %>% wday(label = T, week_start = 1) %>% factor(ordered = F), 
         weekday_weekend = if_else(day_of_week %in% c("Sat", "Sun"), "Weekend", "Week Day") %>% factor())

distances_summary
## # A tibble: 1,223 x 6
##    date       total_distance year  month day_of_week weekday_weekend
##    <date>              <dbl> <fct> <fct> <fct>       <fct>          
##  1 2015-08-21          0.942 2015  8     Fri         Week Day       
##  2 2015-08-22          5.46  2015  8     Sat         Weekend        
##  3 2015-08-23          3.31  2015  8     Sun         Weekend        
##  4 2015-08-24          5.04  2015  8     Mon         Week Day       
##  5 2015-08-25          4.49  2015  8     Tue         Week Day       
##  6 2015-08-26          4.18  2015  8     Wed         Week Day       
##  7 2015-08-27          3.29  2015  8     Thu         Week Day       
##  8 2015-08-28          7.56  2015  8     Fri         Week Day       
##  9 2015-08-29          4.33  2015  8     Sat         Weekend        
## 10 2015-08-30          3.93  2015  8     Sun         Weekend        
## # … with 1,213 more rows

The remainder of the data preprocessing will focus on the step data as it will be the primary data used for analysis (the distances data will only be used for a single plot, so some potential minor data quality issues are fine).

As a data quality check, let’s confirm all the dates are there and all the step totals appear valid.

# date check
steps_summary %>% 
  group_by(year) %>% 
  count()
## # A tibble: 5 x 2
## # Groups:   year [5]
##   year      n
##   <fct> <int>
## 1 2015    133
## 2 2016    355
## 3 2017    365
## 4 2018    365
## 5 2019      5
# check all dates have valid data
steps_summary %>% filter(total_steps <= 0)
## # A tibble: 0 x 6
## # … with 6 variables: date <date>, total_steps <int>, year <fct>,
## #   month <fct>, day_of_week <fct>, weekday_weekend <fct>
# check no totals are too high
steps_summary %>% arrange(desc(total_steps))
## # A tibble: 1,223 x 6
##    date       total_steps year  month day_of_week weekday_weekend
##    <date>           <int> <fct> <fct> <fct>       <fct>          
##  1 2017-05-26       19056 2017  5     Fri         Week Day       
##  2 2017-05-27       17757 2017  5     Sat         Weekend        
##  3 2016-02-21       16166 2016  2     Sun         Weekend        
##  4 2016-11-25       15791 2016  11    Fri         Week Day       
##  5 2017-05-20       15599 2017  5     Sat         Weekend        
##  6 2017-09-28       15496 2017  9     Thu         Week Day       
##  7 2017-09-30       15460 2017  9     Sat         Weekend        
##  8 2016-08-05       15391 2016  8     Fri         Week Day       
##  9 2015-12-28       15363 2015  12    Mon         Week Day       
## 10 2015-09-04       15244 2015  9     Fri         Week Day       
## # … with 1,213 more rows

Step totals appear fine, but there do seem to be some missing dates. After exporting the data a few times over the course of this project I’ve noticed that for some reason it never quite works properly as some dates are always missing. To investigate, I’m first going to put all the missing dates into a data frame and inspect it.

# create for all dates across period data was recorded
all_dates <- seq.Date(steps_summary$date[1], steps_summary$date[nrow(steps_summary)], by = 1) %>% 
  enframe(name = NULL) %>% 
  rename(date = value)

# extract missing dates
missing_dates <- all_dates %>% 
  anti_join(steps_summary, by = "date")

missing_dates
## # A tibble: 11 x 1
##    date      
##    <date>    
##  1 2016-05-28
##  2 2016-06-01
##  3 2016-06-21
##  4 2016-09-28
##  5 2016-09-29
##  6 2016-09-30
##  7 2016-10-01
##  8 2016-10-02
##  9 2016-10-03
## 10 2016-10-04
## 11 2016-10-05

Next let’s add in the days before and after the missing dates to see if they have some peculiar data.

# add days 
missing_dates <- missing_dates %>% 
  mutate(day_before = date - days(1), 
         day_after = date + days(1))

# check days before & after
steps_summary %>% 
  filter(date %in% missing_dates$day_before | date %in% missing_dates$day_after)
## # A tibble: 8 x 6
##   date       total_steps year  month day_of_week weekday_weekend
##   <date>           <int> <fct> <fct> <fct>       <fct>          
## 1 2016-05-27        8917 2016  5     Fri         Week Day       
## 2 2016-05-29          33 2016  5     Sun         Weekend        
## 3 2016-05-31        5078 2016  5     Tue         Week Day       
## 4 2016-06-02          80 2016  6     Thu         Week Day       
## 5 2016-06-20         128 2016  6     Mon         Week Day       
## 6 2016-06-22        2448 2016  6     Wed         Week Day       
## 7 2016-09-27        2609 2016  9     Tue         Week Day       
## 8 2016-10-06        6988 2016  10    Thu         Week Day

It seems the days before and after the missing dates have some odd step counts. May 29th for example has a total of 33 steps, which is way too low to be correct. As a quick-and-dirty fix I’m going to impute these values with the mean (including the data points that don’t appear to be that odd, just to be safe). If this data was being used for a model or more sophisticated analysis this mean imputation could be problematic but for simple analysis it is fine, particularly considering there are so few data points being altered.

For the dates missing altogether the same technique of mean imputation will be used to make the data complete. The mean will be slightly different though as it is calculated with the altered data for the before / after missing dates days, but again this isn’t an issue as I’m going to change the values again in the next preprocessing phase.

# replace total_steps data for days before and after missing dates with mean
steps_summary <- steps_summary %>% 
  mutate(
    total_steps = if_else(
      date %in% (missing_dates %>% pull(day_before)) | date %in% (missing_dates %>% pull(day_after)),
      steps_summary$total_steps %>% mean() %>% as.integer(),
      total_steps
    )
  ) 

# create data frame for missing dates with mean for total_steps
missing_date_data <- missing_dates %>% 
  select(date) %>% 
  mutate(total_steps = steps_summary$total_steps %>% mean() %>% as.integer(),
         year = date %>% year() %>% factor(levels = levels(steps_summary$year)),
         month = date %>% month() %>% factor(levels = 1:12),
         day_of_week = date %>% wday(label = T, week_start = 1) %>% factor(ordered = F),
         weekday_weekend = if_else(day_of_week %in% c("Sat", "Sun"), "Weekend", "Week Day") %>% factor())

# append missing date and re-order data frame based on date
steps_summary <- steps_summary %>% 
  bind_rows(missing_date_data) %>% 
  arrange(date)
# check data is now complete
steps_summary %>% 
  group_by(year) %>% 
  count()
## # A tibble: 5 x 2
## # Groups:   year [5]
##   year      n
##   <fct> <int>
## 1 2015    133
## 2 2016    366
## 3 2017    365
## 4 2018    365
## 5 2019      5

366 is in fact correct for 2016 as it was a leap year.

Before proceeding with the analysis I am going to quickly apply a change to the data points subject the mean imputation as having the same value repeated often messes up some of the charts to follow.

The code below updates the values by adding or subtracting a value within one +/- standard deviation of step count data. This should create realistic values that aren’t duplicated. Again, not necessarily something you would recommend feeding into a model but for this analysis and for very few data points it is sufficient.

# create an index of days to update
update_index <- steps_summary$date %>% 
  map_lgl(~ . %in% missing_dates$date | . %in% missing_dates$day_before | . %in% missing_dates$day_after) 

# create a vector of updated values
update_values <- steps_summary$total_steps[update_index] %>% 
  map_int(
    ~ (. + sample(-sd(steps_summary$total_steps):sd(steps_summary$total_steps), 1)) %>% 
      as.integer()
  ) 

# update values
steps_summary <- steps_summary %>% 
  mutate(total_steps = replace(total_steps, update_index, update_values))

# data - ready to use
steps_summary
## # A tibble: 1,234 x 6
##    date       total_steps year  month day_of_week weekday_weekend
##    <date>           <int> <fct> <fct> <fct>       <fct>          
##  1 2015-08-21        1242 2015  8     Fri         Week Day       
##  2 2015-08-22        7836 2015  8     Sat         Weekend        
##  3 2015-08-23        4416 2015  8     Sun         Weekend        
##  4 2015-08-24        6700 2015  8     Mon         Week Day       
##  5 2015-08-25        5862 2015  8     Tue         Week Day       
##  6 2015-08-26        5432 2015  8     Wed         Week Day       
##  7 2015-08-27        4368 2015  8     Thu         Week Day       
##  8 2015-08-28       10167 2015  8     Fri         Week Day       
##  9 2015-08-29        5388 2015  8     Sat         Weekend        
## 10 2015-08-30        5127 2015  8     Sun         Weekend        
## # … with 1,224 more rows

Now the data is ready for analysis!

The data frame contains the date, total steps and then 4 factor variables extracted from the date.

As a final action before analysis, all data for 2019 will be removed as it only has ~7 days old.

# remove 2019 data
steps_summary <- steps_summary %>% 
  filter(year != 2019) %>% 
  mutate(year = year %>% fct_drop())

Data Summary

Let’s kick of with some basic summary stats by each of the categorical variables

# remove scientific notation
options(scipen = 999)

# mean
steps_summary$total_steps %>% mean()
## [1] 6500.186
# standard deviation
steps_summary$total_steps %>% sd()
## [1] 2999.614

My daily average is almost 6,500. Not quite the “magic” 10,000, but not bad. The standard deviation of ~3,000 shows there is considerable variation in the data.

Now for the same statistics, but by each factor variable.

# create summary stats function
summary_fun <- function(variable) {
  quo_variable <- enquo(variable)
  
  steps_summary %>% 
    group_by(!!quo_variable) %>% 
    summarise(mean = total_steps %>% mean(),
              sd = total_steps %>% sd()) 
}

# extract variable names
var_names <- colnames(steps_summary %>% select(year:weekday_weekend)) %>% 
  syms()

# summary stats
var_summary <- var_names %>% 
  map(summary_fun) 

var_summary
## [[1]]
## # A tibble: 4 x 3
##   year   mean    sd
##   <fct> <dbl> <dbl>
## 1 2015  5894. 2728.
## 2 2016  6042. 2794.
## 3 2017  6670. 3275.
## 4 2018  7011. 2909.
## 
## [[2]]
## # A tibble: 12 x 3
##    month  mean    sd
##    <fct> <dbl> <dbl>
##  1 1     6328. 2851.
##  2 2     7129. 2888.
##  3 3     6526. 2649.
##  4 4     6092. 2740.
##  5 5     6449. 3406.
##  6 6     6137. 3129.
##  7 7     6463. 2887.
##  8 8     6732. 3211.
##  9 9     6300. 3181.
## 10 10    6317. 2428.
## 11 11    6781. 3085.
## 12 12    6715. 3333.
## 
## [[3]]
## # A tibble: 7 x 3
##   day_of_week  mean    sd
##   <fct>       <dbl> <dbl>
## 1 Mon         6624. 2417.
## 2 Tue         5690. 2434.
## 3 Wed         6092. 2369.
## 4 Thu         6378. 2596.
## 5 Fri         8871. 2943.
## 6 Sat         7393. 3277.
## 7 Sun         4446. 2860.
## 
## [[4]]
## # A tibble: 2 x 3
##   weekday_weekend  mean    sd
##   <fct>           <dbl> <dbl>
## 1 Week Day        6733. 2788.
## 2 Weekend         5920. 3407.

From 2015 to 2018 there was an improvement each year. Looking at the day of week data shows Sunday is my laziest day while Friday stands out as a day when I seem to get a lot more steps than others.

Plots

The summary stats tables are nice but plots will give a lot more insight into the data and allow for easy exploration. I’ve split the plots into basic summaries, by year, by month, by day of week and then a random assortment at the end.

Summary Plots

# create plot function
bar_plot <- function(variable) {
  quo_variable <- enquo(variable)
  quo_text <- quo_text(quo_variable) %>% 
    str_replace_all("_", " ") %>% 
    tools::toTitleCase()
  
  steps_summary %>% 
    group_by(!!quo_variable) %>% 
    summarise(mean_steps = total_steps %>% mean()) %>% 
    ggplot(aes(!!quo_variable, mean_steps)) +
    geom_col(position = "dodge") +
    labs(title = paste("Mean Steps by", quo_text),
         x = quo_text,
         y = "Total Steps") +
    theme(plot.title = element_text(hjust = 0.5))
}

# summary plots
var_names %>% map(bar_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Years

Starting off with a histogram featuring all the data, with a line for the mean (the mean line can have a legend in ggplot2 as you can see in my footy tipping project, but I find it cumbersome to add).

title_adjust <- theme(plot.title = element_text(hjust = 0.5))

steps_summary %>% 
  ggplot(aes(total_steps)) +
  geom_histogram(binwidth = 200) +
  geom_vline(xintercept = mean(steps_summary$total_steps), colour = "dodgerblue", size = 2) +
  labs(title = "Total Steps Histogram",
       x = "Total Steps",
       y = "Count") +
  title_adjust

Let’s split by year to see how the shape of histogram has changed over time. The blue line now indicates the overall mean (as seen on the previous chart) and the red lines are the mean for each year.

steps_summary %>%
  ggplot(aes(total_steps)) +
  geom_histogram(binwidth = 200) +
  geom_vline(xintercept = mean(steps_summary$total_steps), colour = "dodgerblue", size = 2) +
  facet_grid(year ~ .) +
  geom_vline(data = var_summary[[1]], aes(xintercept = mean), colour = "red", size = 2) +
  labs(title = "Total Steps Histograms by Year",
       x = "Total Steps",
       y = "Count") +
  title_adjust

The trend is clear, back in 2015 & 2016 a lot of my step totals were < 5,000 but now this mark is consistently eclipsed, particularly in 2018. The histograms also clearly show the data for 2015 is incomplete.

Now, for a final iteration of the histogram featuring a colour fill for the week day-weekend variable. The overall mean is now indicated with a black line.

steps_summary %>%
  ggplot(aes(total_steps, fill = weekday_weekend)) +
  geom_histogram(binwidth = 200) +
  geom_vline(xintercept = mean(steps_summary$total_steps), size = 2) +
  scale_fill_brewer(palette = "Dark2") +
  facet_grid(year ~ .) +
  labs(title = "Total Steps Histogram by Year",
       x = "Total Steps",
       y = "Count",
       fill = "") +
  title_adjust

It’s clear that, especially in 2018, weekends make up the majority of the lowest totals. I do have some lazy Sundays, and the data backs that up.

It also seems like over the period 2015 to 2018 most of my improvement has been on week days, rather than weekends. A simple line chart can help confirm this.

steps_summary %>% 
  group_by(year, weekday_weekend) %>% 
  summarise(mean = mean(total_steps)) %>% 
  ggplot(aes(year, mean, group = weekday_weekend, colour = weekday_weekend)) +
  geom_line(size = 2) +
  scale_colour_brewer(palette = "Dark2") +
  scale_y_continuous(limits = c(0, 8000)) +
  labs(title = "Trended Mean Total Steps",
       x = "Year",
       y = "Mean",
       colour = "") +
  title_adjust

It’s clear that all the improvement I’ve seen in my average daily step counts has been achieved through improvements on week days.

Months

Again, going with histograms but this time split by month with the overall mean shown with the dotted line. Using geom_density_ridges from the ggridges packages allows for slightly more compact histograms than a ggplot2 facet.

month_ridge_1 <- steps_summary %>% 
  filter(month %in% 1:6) %>% 
  ggplot(aes(total_steps, fct_rev(month))) +
  geom_density_ridges(stat = "binline", bins = 50, size = 0.2, scale = 1, fill = "seagreen1") +
  geom_vline(xintercept = mean(steps_summary$total_steps), linetype = "dotted", size = 1) +
  labs(title = "January - June Total Steps Histograms",
       x = "Total Steps",
       y = "Month") +
  title_adjust


month_ridge_2 <- steps_summary %>% 
  filter(month %in% 7:12) %>% 
  ggplot(aes(total_steps, fct_rev(month))) +
  geom_density_ridges(stat = "binline", bins = 50, size = 0.2, scale = 1, fill = "seagreen1") +
  geom_vline(xintercept = mean(steps_summary$total_steps), linetype = "dotted", size = 1) +
  labs(title = "July - December Total Steps Histograms",
       x = "Total Steps",
       y = "Month") +
  title_adjust

cowplot::plot_grid(month_ridge_1, month_ridge_2)

Unlike the data for the years, there isn’t any patterns jumping out. This makes sense though, as my schedule is fairly consistent month-to-month even with uni etc.

Combining year and month in plot might yield some information though. Swapping to a box as it is well-suited to this approach.

steps_summary %>% 
  filter(year != 2015) %>% # remove 2015 as it is incomplete
  ggplot(aes(month, total_steps, fill = year)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Total Steps Box Plots by Month",
       x = "Month",
       y = "Total Steps",
       fill = "Year") +
  title_adjust

This gives some better insight. I’m not going to try and unpack everything on the box plot but one thing that stands out to me is that the interquartile range really grows throughout 2016 and then for 2017 & 2018 it is seemingly more steady with some random noise. This chart also reinforces the previous histograms, with the median typically rising for each year from 2016 to 2018 (this isn’t always the case, though).

Day of Week

Again, beginning with a histogram featuring a dotted line for the overall mean.

steps_summary %>% 
  ggplot(aes(total_steps, fct_rev(day_of_week))) +
  geom_density_ridges(stat = "binline", bins = 80, size = 0.2, scale = 1, fill = "lightcoral") +
  geom_vline(xintercept = mean(steps_summary$total_steps), linetype = "dotted", size = 1) +
  labs(title = "Total Steps Histograms by Day of Week",
       x = "Total Steps",
       y = "Day of Week") +
  title_adjust

Monday - Thursday is relatively consistent in terms of spread of data and the number of points above and below the mean. Friday - Sunday is a different story though. Friday has the majority of its data above the mean, Saturday is the most spread of all days and then Sunday features a lot of data points well below the mean.

I’m again going to go back to the box plot with year again to get some extra information out of this variable.

steps_summary %>% 
  filter(year != 2015) %>% # remove incomplete year
  ggplot(aes(day_of_week, total_steps, fill = year)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Total Steps Box Plots by Day of Week",
       x = "Day of Week",
       y = "Total Steps",
       fill = "Year") +
  title_adjust

The box plot is again very informative when layered with the year variable. Friday has always been a consistently high day, even back in 2016. Tuesday and Wednesday were big improvers for 2018 vs 2017, while Monday and Sunday fell away.

Some of Everything

Now to show off some of the other cool things ggplot2 and its many extenstions can do and put together some other charts this kind of data is well-suited to.

The first plot involves probably the most simple plotting of this data: a line plot with total steps by day. I’ll also add in a 30-day rolling mean and use ggforce to zoom in on part of the plot.

# create rolling mean vector
rolling_mean <- RcppRoll::roll_mean(steps_summary$total_steps, n = 30)

# calculate end of vector to be filled with NA values 
# rolling 30 day mean cannot be calculated for final 29 days
diff <- nrow(steps_summary) - length(rolling_mean)

# fill in vector
rolling_mean <- c(rolling_mean, rep(NA, diff))
names(rolling_mean) <- steps_summary$date

steps_summary %>% 
  ggplot(aes(date, total_steps)) +
  geom_line() +
  geom_line(data = steps_summary %>% mutate(y = rolling_mean),
            aes(date, y),
            colour = "dodgerblue",
            size = 2) +
  ggforce::facet_zoom(date %in% seq.Date(as_date("2018-01-01"), as_date("2018-06-30"), by = 1)) +
  scale_y_continuous(breaks = seq(0, 20000, by = 1000)) +
  labs(title = "Total Steps by Day with Rolling 30 Day Mean",
       x = "Day",
       y = "Total Steps") +
  title_adjust

One thing I’ve also been curious about was how well the total steps correlates with distance covered. I’ve always assumed days where I went for a run resulted in higher distance but not necessarily as many steps as if I’ve covered the same amount of ground walking. Creating a scatter plot of total steps vs total distance is hardly impressive but ggplot2 has the alpha aesthetic which allows us to see more common data points and the line of best can be fitted easily and the simple linear regression equation and \(R^2\) can be added to the plot with the help of ggpmisc.

steps_summary %>% 
  inner_join(distances_summary, by = "date") %>%
  # remove days with data imputed
  filter(!(date %in% missing_dates$date | date %in% missing_dates$day_before | date %in% missing_dates$day_after)) %>% 
  ggplot(aes(total_steps, total_distance)) +
  geom_point(size = 2, alpha = 0.25) +
  geom_smooth(method = lm, se = F, colour = "firebrick1") +
  ggpmisc::stat_poly_eq(formula = y ~ x, 
                        aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                        parse = T) +
  labs(title = "Total Steps vs Total Distance",
       x = "Total Steps",
       y = "Total Distance") +
  title_adjust

Obviously, the correlation is very high but there is still some variation. After ~5,000 steps I sometimes have covered ~3km but other times the distance is almost ~5km, a difference of ~65%.

Next is one of my favourite plots: the heatmap. These can be done easily with ggplot2 like below.

# create cross tab data frame
year_day_xtab <- steps_summary %>% 
  xtabs(total_steps ~ year + day_of_week, data = .) %>% 
  as_tibble() %>% 
  filter(year != 2015) %>% # remove incomplete year
  mutate(year = year %>% factor(),
         day_of_week = day_of_week %>% factor(levels = steps_summary$day_of_week %>% levels()))

year_day_xtab %>% 
  ggplot(aes(year, fct_rev(day_of_week), fill = n)) +
  geom_raster() +
  scale_fill_gradient(low = "white", 
                      high = "violetred4",
                      limits = c(0, max(year_day_xtab$n))) +
  labs(title = "Total Steps by Year & Day of Week Heatmap",
       x = "Year",
       y = "Day of Week",
       fill = "Total Steps") +
  guides(fill = guide_colourbar(ticks = F)) +
  title_adjust

They can also be built to add things like a calendar view with ggTimeSeries.

steps_summary %>% 
  filter(year != 2015) %>% # remove incomplete year
  ggTimeSeries::ggplot_calendar_heatmap("date", "total_steps") +
  facet_wrap(~ Year, ncol = 1) +
  scale_fill_continuous(low = "green", high = "red") +
  labs(title = "Total Steps by Year & Individual Day Heatmap",
       x = "Month",
       y = "Day of Week",
       fill = "Total Steps") +
  guides(fill = guide_colourbar(ticks = F)) +
  title_adjust

While a plot like this would be best-suited to seasonal data, it does do a good job of showing the improvement from 2016 to 2018 in terms of steps.

Next up is the jitter plot. A jitter plot is a point plot that adds some random variation to each point to avoid over-plotting that would occur if a simple point plot was used for the below.

steps_summary %>%
  filter(year != 2015) %>% 
  ggplot(aes(day_of_week, total_steps, colour = day_of_week)) +
  geom_jitter(width = 0.2) +
  facet_wrap(~ year) +
  labs(title = "Total Steps by Year & Day of Week Jitter Plot",
       x = "Day of Week",
       y = "Total Steps") +
  guides(colour = F) +
  title_adjust

This plot really shows how much I avoided low step days on Tuesdays and Wednesdays in 2018 but really piled up low totals on Sundays.

The final plot is a mosaic plot, another plot type I’m a big fan of. This essentially gives a similar insight to the 1st heatmap, showing the totals for the cross-tab between 2 categorical variables (in this case year and day of week).

year_day_xtab %>% 
  ggplot() +
  geom_mosaic(aes(x = product(day_of_week), weight = n, fill = year)) +
  labs(title = "Total Steps by Year & Day of Week Mosaic Plot",
       x = "Day of Week",
       y = "Year") +
  guides(fill = F) +
  title_adjust

Hopefully this has shown off some of the cool things R and ggplot2 can do with such a simple dataset like this.